An Architecture for the Aggregation and Analysis 
of Scholarly Usage Data. 



Johan Bollen 

Digital Library Researcin & Prototyping Team 
Los Alamos National Laboratory 
Los Alamos 
NM, 87545 

jbollen@lanl.gov 



Herbert Van de Sompel 

Digital Library Research & Prototyping Team 
Los Alamos National Laboratory 
Los Alamos 
NM, 87545 

herbertv@lanl.gov 



ABSTRACT 

Although recording of usage data is common in scholarly in- 
formation services, its exploitation for the creation of value- 
added services remains limited due to concerns regarding, 
among others, user privacy, data validity, and the lack of 
accepted standards for the representation, sharing and ag- 
gregation of usage data. This paper presents a technical, 
standards-based architecture for sharing usage information, 
which we have designed and implemented. In this architec- 
ture, OpenURL-compliant linking servers aggregate usage 
information of a specific user community as it navigates the 
distributed information environment that it has access to. 
This usage information is made OATPMH harvestable so 
that usage information exposed by many linking servers can 
be aggregated to facilitate the creation of value-added ser- 
vices with a reach beyond that of a single community or a 
single information service. This paper also discusses issues 
that were encountered when implementing the proposed ap- 
proach, and it presents preliminary results obtained from 
analyzing a usage data set containing about 3,500,000 re- 
quests aggregated by a federation of linking servers at the 
California State University system over a 20 month period. 

Categories and Subject Descriptors 

H. 3.7 [Information storage and retrieval]: Digital Li- 
braries; H.3.3 [Information Storage and Retrieval]: in- 
formation Search and Retrieval 

General Terms 

Digital Libraries, usage data, architecture, standards, ag- 
gregation, analysis, OAI-PMH, OpenURL 

I. INTRODUCTION 

Applications of usage data have altered the landscape of 
commercial information services. A user-driven revolution 
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is underway in which end-user services are no longer solely 
based on top-down design decisions, but have come to promi- 
nently include the analysis of user actions and preferences. 
The introduction of a wide range of successful evaluation and 
recommendation services which rely in some shape or form 
on the analysis of usage data has fundamentally changed 
how users interact with online services and is having a pro- 
found effect on their behavior and preferences [1]. 

It is natural that users of scholarly information services 
expect the same kind of functionality that is now common 
in the Web-based retail domain. But the applicability of 
usage data in the scholarly realm goes beyond that of rec- 
ommendation services, and also includes collection develop- 
ment support [4], derivation of metrics to assess the quality 
and impact of scholarly communication units [3], and trend 
analysis [14, 5). Especially for the latter two applications, 
the large-scale aggregation of usage data across information 
services and communities is essential, as the derivation of 
global measures of impact and the identification of global 
trends requires usage data that is representative at a global 
scale. 

For that reason, a number of recent initiatives have fo- 
cused on making usage data of scholarly information ser- 
vices available and on promoting its application in the do- 
main of scholarly support services. An XML-based format 
for the representation of Digital Library usage data was pro- 
posed by Goncalves (2002) [11], and a log archive was cre- 
ated to which usage logs could be posted, and later, down- 
loaded. Unfortunately, neither the proposed format nor the 
log archive have been widely adopted. An OAI-PMH based 
solution for the representation and aggregation of usage data 
was proposed by Van de Sompel (2002) [10]. Although, in 
hindsight, the proposed format for the exposure of usage 
data seems to have been inspired by a specific application, 
that effort still provides the foundation for the work reported 
here. The COUNTER project^ has led to the specification 
of a format for recording journal-level usage statistics that 
information services can use to reliably share usage data 
with subscribing institutions. The SUSHI project^ aims to 
automate the exchange and aggregation of COUNTER us- 
age statistics by means of a dedicated set of web services 



^COUNTER: Counting Onhne Usage of NeTworked Elec- 
tronic Resources, http://projectcounter.org/ 

^ SUSHI: Standardized Usage Statistics Harvesting Ini- 
tiative, http : //www . library . Cornell . edu/cts/elicensestudy/ 
ermi2/sushi/ 



and protocols. And, the IRS project^ focuses on usage in- 
formation recorded by Institutional Repositories (IR) and 
aims to identify which parameters should be recorded uni- 
formly across IRs, how such parameters can be derived from 
IRs and how the results can be shared for aggregation. 

The remainder of this paper is organized as follows: Sec- 
tion 2 discusses the basic architecture of the proposed so- 
lution for the aggregation of usage data, and it details the 
manner in which the solution builds on accepted standards; 
to illustrate the potential of the proposed solution, Section 3 
provides insights in prototype services that were developed 
on the basis of a large aggregated usage data collection; and 
Section 4 discussed issues that were encountered when im- 
plementing the proposed solution. 

2. ARCHITECTURE 

This section outlines a technical, standards-based archi- 
tecture for recording, representing, sharing and mining us- 
age information of scholarly information services. OpenURL- 
compliant linking servers play an important role in the pro- 
posed solution, as they naturally aggregate the navigations 
of a specific user community across the distributed informa- 
tion services that are available to them. As will be discussed 
in detail in the remainder of this section, the following four 
phases can be distinguished in the proposed log harvesting 
architecture: 

1. Intra-Institutional Aggregation of Usage Data 

Usage events generated by users of a specific institu- 
tion as they navigate their distributed scholarly infor- 
mation environment are recorded by the institutional 
linking server. 

2. Exposure of Institutional Usage Data The insti- 
tutional usage data recorded by the linking server is 
exposed through an OAI-PMH-compliant log reposi- 
tory in which each event is represented as XML Con- 
textObjects. 

3. Inter-Institutional Aggregation of Usage data 

OAI-PMH harvesters collect the usage data from a va- 
riety of OAI-PMIf-compliant institutional log reposi- 
tories. 

4. Service Provision Value-added services are imple- 
mented based on the aggregated usage data collection. 

2.1 Intra-Institutional Aggregation of Usage 
Data 

Starting around 2001, scholarly information services have 
in great numbers begun to support the concept of context- 
sensitive services [9] by implementing the OpenURL 0.1 
specification*, while academic and research libraries have 
increasingly installed the linking servers that are required 
to provide localized services to their user base. 

Fig. 1 shows the information environment of a particular 
user community as it could exist for an academic institution. 
It shows the many distributed scholarly information services 
that are accessible to that user community, and it shows an 

^IRS: Interoperable Repository Statistics, http://irs. 

eprints . org/ 
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institutional linking server as a central hub in this informa- 
tion environment. In the context-sensitive service environ- 
ment enabled by OpenURL and linking servers, an informa- 
tion service, such as Google Scholar, inserts an OpenURL for 
every reference to a scholarly work that it presents to a user, 
for example, as a search result. This OpenURL is an HTTP 
GET request carrying metadata that are essential to identify 
the referenced work. It points to the linking server of the 
users' institution which contains a rule engine powered by 
a knowledge database that is typically maintained by the 
user's institutional library. Given an incoming OpenURL 
request, and through consultation of its localized rules and 
knowledge database, the linking server returns a list of ser- 
vices pertaining to the referenced work to the user. Those 
services typically point into other information services avail- 
able in the users' distributed information environment, such 
as e.g. Ingenta, ISI, Publishers sites and Full-text DBS as 
shown in Fig. 1. 




Figure 1: Linking servers are well-positioned to cap- 
ture usage data. 

The central position held by a linking server in the dis- 
tributed scholarly information environment of a specific user 
community makes it particularly appealing as the source of 
usage information. Indeed, a linking server logs the OpenURL 
requests of all users of the community originating from many 
of the available distributed information sources. As such 
it de-facto aggregates usage information at the level of the 
community, and internally represents the usage information 
in a normalized manner. 

A particular and unique advantage of linking servers us- 
age data is the ability to track sequences of requests across 
a variety of information services. Indeed, in Fig. 1, the 
linking server is aware of the fact that the user was navigat- 
ing Google Scholar, and requested services from the link- 
ing server pertaining to a referenced work. The linking 
server also knows which of those services was chosen by the 
user. And, if that chosen service led into another OpenURL- 
compliant information service, e.g .Ingenta in Fig. 1, and if 
the user again requested services from the linking server per- 
taining to a work referenced there, the linking server would 
again be aware. Such a sequence of requests can be recorded 
by the linking server and hence exploited by click-stream 



based methods of log analysis [15, 16] to reveal temporal 
trends in user behavior and recommending items which are 
often accessed in a particular sequence. Such temporal pat- 
terns would be very difficult, if not impossible, to reconstruct 
from the aggregation of the logs obtained from each of the 
individual information services in the users' environment. 

2.2 The OpenURL ContextObject for the in- 
teroperable representation of usage data 

Each brand of available OpenURL-compliaiit linking server 
probably stores its usage information in a proprietary man- 
ner. However, when the goal is to share usage information 
across a federation of heterogeneous linking servers, support 
for a common representation format becomes important. 

The NISO/ANSI Z39.88-2004 Standard "The OpenURL 
Framework for Context-Sensitive Services"^ is a powerful 
generalization of the context-sensitive service concepts that 
were at the basis of the definition of OpenURL 0.1. At 
the core of the standard is the notion of the ContextOb- 
ject. The ContextObject is an abstract data structure that 
encapsulates six entities that are involved in the fulfillment 
of a context-sensitive service request. The structure of the 
OpenURL ContextObject is shown in Fig. 2. At the core 
of the ContextObject is the Referent; it is the actual sub- 
ject of the service request that the ContextObject encodes, 
i.e. the service request pertains to the Referent. The Re- 
quester is the agent that requests the service pertaining to 
the Referent. The ServiceType specifics the type of service 
that is requested. The ContextObject furthermore contains 
the ReferringEntity, i.e, the entity that references the Ref- 
erent, the Resolver which is the target of a service request, 
and the Referrer which is the system that generated the 
ContextObject. 

Each entity of the ContextObject can be described by 
means of a combination of identifiers, metadata and private 
data. To allow for a controlled deployment of applications 
based on the OpenURL Standard, the OpenURL Registry^ 
provides the capability to register identifier namespaces and 
Metadata Formats that are used in OpenURL Applications. 
The abstract ContextObject data structure can be instanti- 
ated using different serializations, and both a Key/Encoded- 
Value pair serialization and an XML serialization have been 
defined as part of the NISO standardization effort. A service 
request pertaining to a Referent is achieved by transporting 
a serialized ContextObject with the Referent at its core to- 
wards a Resolver. 

ContextObject - 

-- ( RefemngEntSy ] * 

-( ^ Requester 

-( ServiceType 



- ( ReferTBT ) ^ ( Dei^f^pfor 



Iden&rer J 
By-Value Melarlata ^ 



By-Reference Meta( Ma J 
Private Data ) 



Figure 2: Structure of OpenURL ContextObject. 



In the perspective of the NISO/ANSI Z39. 88-2004 Stan- 
dard, linking servers as described above have become a spe- 
cial type of Resolver, namely Resolver that supports a spe- 
cific OpenURL Application known as the San Antonio Pro- 
file'^'*. Also, according to the new standard, a service re- 
quest targeted at a linking server is the transportation of a 
ContextObject with a description of the referenced work at 
its core (the Referent) towards the linking server (the Re- 
solver). Hence, since a ContextObject is the embodiment 
of a service request aimed at linking servers, the Contex- 
tObject also provides an appropriate data structure for the 
representation and sharing of usage information recorded by 
linking servers. 

In order to illustrate the mapping between a service re- 
quest issued to a linking server and the ContextObject data 
structure, it is worthwhile pointing out that each individ- 
ual usage event can, in essence, be described by a triplet 
consisting of: 

What The item for which the usage was recorded, e.g. a 

journal article. 

Who The originator of the event, e.g. the user 

When The time at which the event occurred, i.e. the event's 
timestamp 

In the proposed usage log representation technique, a us- 
age event is defined as an individual OpenURL-compliant 

service request targeted at a linking server. Such a usage 
event is represented as an individual ContextObject accord- 
ing to the mapping described in the remainder of this para- 
graph. The what and who components of the triplet can 
readily be mapped to the Referent and Requester entities of 
the ContextObject, respectively. Moreover, as can be seen 
in Table 1, the ContextObject allows for the inclusion of 
descriptions of other entities that are relevant to a service 
request and hence to the downstream exploitation of the 
represented usage data: the Resolver, the ServiceType, the 
Referrer, and the ReferringEntity. These mappings are in- 
dependent of the serialization format of the ContextObject. 
In order to include the when component of the triplet, a 
choice for the XML serialization of the ContextObject must 
be made as that provides an administrative information el- 
ement, namely the timestamp attribute, which indicates 
when the ContextObject was generated, which in the con- 
text of this application is the moment at which the service 
request was issued (see Table 2). Moreover, when aiming 
at the global sharing of usage information it is important 
to be able to unambiguously identify each event recorded 
by a linking server. To that end, each event represented 
by a ContextObject is accorded a globally unique UUID [9] 
by the linking server, which can be conveyed using another 
administrative element of the XML ContextObject, namely 
the identifier (see Table 2). 

The ability to express all the entities that pertain to a ser- 
vice request in a standard-based, self-contained data struc- 
ture is quite appealing in light of the need to share usage 
data across the boundaries of information services and com- 
munities. Because the proposed mappings are done with 
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the aim of sharing usage iiiforination across a federation of 
Unking servers, the choice for the XML ContextObject for- 
mat seems logical instead of restrictive. Moreover, the XML 
ContextObject format allows entities of the ContextObject 
to be described using more than one Metadata Format, as 
such allowing for very expressive descriptions whenever pos- 
sible or appropriate. 



Entity 



Mapped usage data 



Referent 

ReferringEntity 

Requester 

ServiceType 

Resolver 

Referrer 



Item used, e.g. journal article 
Entity that referenced the Referent 
User identification 

Type of usage, e.g. "full-text request" 

Linking server 

Information system that generated the 
ContextObject, i.e. the system that 
the user was navigating when issuing 
a service request to the linking server 



Table 1: Mapping of usage data to the ContextOb- 
ject data structure 

Table 2 shows an example of linking server usage data 
encoded according the described mapping; for brevity XML 
Namespace declarations have been omitted: 

• The root context-object element has two attributes, 
the timestamp and the identifier, with semantics 
as described above. 

• The Referent is a journal article that is being described 
by both an identifier and by metadata. The identifier 
is a DOI identifier expressed as a URI following the 

"info" URI scheme^, while the metadata is compli- 
ant with the XML Metadata Format to describe jour- 
nal articles. This Metadata Format is registered in 
the OpenURL Registry^" along with its XML Schema 
definition^^. The OpenURL Registry contains XML 
Metadata Format definitions for other types of schol- 
arly works including dissertations, books, and patents. 
Metadata Format definitions for other types of works, 
for example datasets, can be registered, and, for flex- 
ibility, the OpenURL Standard allows for the use of 
unregistered Metadata Formats. 

• The Requester is described by means of an identifier 
which in this example consists of the IP address of 
the user's computer represented using an ad-hoc URN 
scheme. Typically, the IP address is the only infor- 
mation that is available for a Requester. However, if 
more information, such as an EduPerson user record^'^ , 
would be available, it could be expressed using XML 
and hence could be included as a metadata description 
of the Requester. Similarly, the inclusion of session in- 
formation as a more expressive proxy to the Requester 
than the IP address is possible through the definition 
of a Metadata Format or identifier scheme. However, 
due to privacy concerns it is likely that less rather than 
more Requester information would be made available 
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when sharing usage data beyond the boundaries of an 
institution, or that such information would become en- 
crypted or anonymized. 

• The ServiceType is described by means of a Metadata 
Format registered in the OpenURL Registry^^. This 
Metadata Format allows for the expression of one or 
more services that were actually requested by the user 
from the linking server about the scholarly item that 
is the Referent. The example indicates that the user 
requested the full-text of the article; other services 
that are expressible using this Metadata Format in- 
clude "abstract", "citation" and "holding". 

• The Resolver is described by means of an identifier, 
which is the HTTP address of the linking server at 
which the service request was targeted. Again, it can 
easily be imagined that this information would be en- 
crypted or anonymized if usage information is shared 
beyond an institution. 

• The Referrer is described by means of an identifier 
using the registered "sid" namespace of the info URI 
scheme. The identifier indicates that the user was nav- 
igating the Scopus service of Elsevier Science when re- 
questing services from the linking server. 

2.3 Inter-Institutional Aggregation of Usage 
Data 

Usage data pertaining to the scholarly information col- 
lection of a specific institution is a valuable asset for those 
institutions that choose to record and exploit it. For exam- 
ple, it allows management to track usage as it occurs and 
to make accurate and community-driven collection devel- 
opment decisions [4]. It can also be used in recommender 
services to enhance the discovery capabilities of users. 

However, increased value of usage data can be realized 
when it is aggregated over a large number of information 
sources and communities such that a representative sam- 
ple of the activities of the scholarly community (or a well- 
defined subset thereof) is obtained. If such aggregation can 
be achieved, applications can be imagined that discover, an- 
alyze and predict trends in the scholarly endeavor; recom- 
mender systems can be built that are based on the activ- 
ities of a global scholarly user base; and a new generation 
of usage-based impact and quality metrics can be defined, 
deployed and used to balance the monopoly of the citation- 
based ISI Impact Factor [8, 3]. 

To allow for the emergence of large-scale collections of us- 
age data, mechanisms for exchange and aggregation need to 
be devised. In the proposed approach, linking servers are 
used as intra-institutional aggregators of usage information. 
In order to allow for the mter-mstitutional aggregation of us- 
age information, an OAI-PMH based technique is proposed 
in which an OAI-PMH repository with the following core 
properties exposes the usage logs of an institutional linking 
server: 

• Contained records are XML ContextObjects only. Each 
ContextObject represents an event recorded by the 
linking server as explained above. 



info : of i/fmt : xml : xsd: sch_svc 



<?xml version^" 1 . 0" encoding="UTF-8" ?> 






<ctx:context —object 






timestamp=" 2005-11-11 T17:45:08Z" 


<!— 


event date and time — ^> 


identifier=" urn:UUID:58f202ac -22 cf -lldl-bl2d -002035 b29062 "> 


<! 


global event ID — ^> 


<ctx:referent> 


<! r 


eferent data — ^> 


<ctx:identifier>info:doi/10.1016/j . ipm . 2 5 . 3 . 24 </ c t x : i d e n t i f i e r> 






< ! referent identifier > 






<ctx: metadata— by — v al> 


<! — 


referent metadata — ^> 


<^ c t X : lor 111 at^iiifo:ofi / f iii 1 1 x ni l^xsdijouriial-^ / ctx:for m a t]> 






<ctx:metadata> 






< j ou:journal> 






< j o u : a t i t le>Toward alternative metrics of journal impact</jou 


a t i t le> 




<^jou:j titled Information Processing and Management <C / jou:jtitle> 






</ ctx:metadata> 






</ ctx:referent > 






<ctx: request er> 


<! — 


requester ID — ^> 


<ctx:identif ier>urn:ip:63 .236.2. 100</ ctx:identifier> 






</ ctx: request er> 






<ctx:service— type> 


<! — 


type of request — ^> 


<ctx:metadata— by— val> 


<! — 


referent metadata — ^> 


<ctx:format>info : ofi / fmt:xml:xsd:sch_svc</ctx: f or m at> 






<ctx: metadata> 






<full -tcxt>yes</full -text> ••■ 






< / ctx: metadata> 






</ ctx:service — type> 






<!ctx : resolve r> 


< ! 


resolver ID — ^ 


<ctx:identif'ier>http: // sfx . example, org/ menu</ ctx: identifier> 






</ ctx:resolver> 






<ctx:referrer> 


<! 


referrer ID — ^> 


<ctx:idcntificr>info:sid/ elsevier .com;scopus</ ctx:identifier> 






</ ctx:rcfcrrcr> 






</ ctx: context — object > 







Table 2: Abbreviated sample demonstrating the representation of usage data as OpenURL ContextObjects 



• The identifier used by the OAI-PMH is the globally 
unique UUID that unambiguously identifies a link- 
ing server event; it is the same as the value of the 
identifier attribute to the root element of the XML 
ContextObject. 

• The datestamp used by the OAI-PMH is the datetime 

the event was uploaded to the log repository. Beeause 
it is expected that the log repository will be a deriva- 
tive of the logs as stored by the linking server, the OAI- 
PMH datestamp does not coineide with the datetime 
of the event as provided in the time stamp attribute 
to the root element of the XML ContextObject. It 
should be noted that the datestamp of a record never 
changes, as an event will never be updated once it has 
been recorded and uploaded to the log repository. As 
a result, once harvested by a usage data aggregator, a 
record will not have to be re-harvested. 

• The only supported metadata format is the XML Con- 
textObject Format (with metadataPrefix resolver Jogs), 
registered in the OpenURL Registry^* along with an 
XML Schema definition^^ 

• The harvesting granularity can be either at the day- 
level or the seconds-level. 

• No set structure is provided. 

Table 3 shows an example of an OAI-PMH record that 

[4 
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contains the ContextObject of Table 2. 

This OAI-PMH-based approach allows harvesters to re- 
currently collect usage data as recorded by institutional link- 
ing severs, and to compile a usage data collection with a 
global or regional reach. The creation of aggregated col- 
lections with a different focus, such as a discipline-specific 
aggregation, would either require post-processing of har- 
vested logs, or the introduction of conventions regarding 
discipline-oriented set structures at the level of the log repos- 
itories. The latter has proven to be problematic for OAI- 
PMH repositories that expose Dublin Core metadata, and, 
given their noisy nature, may turn out to be unrealistic to 
achieve for usage data. 

2.4 Service Provision 

Once usage data is aggregated across the boundaries of 
a single information service and a single institution, as fa- 
cilitated by the aforementioned approach, services can be 

created on the basis of the aggregated usage data collec- 
tion. A major attraction of the proposed approach is that 
many aggregators can emerge, each of which could experi- 
ment with the creation of yet unknown services by mining 
the usage data collection in a variety of ways. As described 
in the following Section, our experiments have so far mainly 
focused on the creation of a rccommcndcr system and on the 
extraction of metrics that may eventually be attractive for 
the assessment of the quality and impact of scholarly works. 

3. RESULTS 

As a proof of concept, the described architecture was im- 
plemented in conjunction with the most widely deployed 




Figure 3: Exposing and harvesting linking server 
logs using OAI-PMH and OpenURL ContextOb- 
jects. 



Table 3: Sample OAI-PMH record containing 
OpenURL ContextObject 



commercial linking server, the SFX server from Ex Libris . 
To that end, an autonomous usage data add-on to the SFX 
linking server software with the following capabilities was 
implemented: 

1. The add-on can ingest the linking server's usage data 
into a special purpose log database. 

2. The linking server's log database is exposed as an OAl- 
PMH-compliant log repository with characteristics as 
described in the above. 



http : / /www . exlibr is group . com/ sf x . htm 



3. The add-on contains an OAI-PMH usage data har- 
vester which has the ability to collect usage data from 
remote OAI-PMH-compliant log repositories, and to 
merge the harvested usage data with the linking server's 
own usage data. Care is taken not to re-expose usage 
data that was obtained through harvesting from re- 
mote linking servers. 

To gain experience with building services on the basis 
of a large log collection, usage data data was aggregated 
across the California State University (CalState) campuses. 
The CalState system is one of the largest systems of pub- 
lic universities in the US. It comprises 23 campuses and 
seven off-campus centers which combined have a popula- 
tion of 409,000 students and 44,000 faculty and staff. The 
CalState system has deployed SFX live since Fall 2002 and 
uses an SFX consortium model consisting of 23 SFX linking 
servers (one per campus) and 1 for shared resources (oper- 
ated by the Chancellor's Office). For reasons of scale and its 
long-standing use of linking servers the CalState data offered 
a unique testing opportunity. Our initial experiences with 
building services on the basis of a large usage data collection 
are discussed in the remainder of this section. 

3.1 Aggregated Usage Data Collection 

Usage data from 9 CalState linking servers were included 
for an initial analysis. These 9 linking servers were se- 
lected because their IP-address distributions suffered the 
least from IP-address distortions caused mainly by the re- 
liance on proxy servers when requesting services from a link- 
ing server. When accessing a linking server through a proxy, 
the real IP-addresses of the Requester are obscured^^. 

The 9 selected instances were Chancellor, CPSLO, Los 
Angeles, Northridge, Sacramento, San Jose, San Marcos, 
SDSU, and finally SFSU. They represented the majority of 
linking server usage data in the CalState system. Usage data 
had been recorded at these institutions in the period Novem- 
ber 11th, 2003 (10:44 AM) to August 18th, 2005 (11:43PM). 

This data set was aggregated and loaded into the afore- 
mentioned add-on's special purpose log databases. The re- 
sulting collection consisted of 3,507,484 unique usage events. 
A de-duplication process run on the basis of the identifiers 
and the metadata describing the Referents (documents) in- 
volved in the events, resulted in a total of 2,133,556 unique 
Referents in the data set; the set contained 167,204 unique 
Requesters when using the user agent's IP-address as the 
Requester's identifier. A majority, i.e. 67%, of the Referents 
were journal articles. 

3.2 Mining item relationships from CalState 
usage data 

Usage data naturally consists of a temporally ordered but 
otherwise unconnected sequence of usage events. In order 
to perform more sophisticated, network-based methods of 
Referent ranking and to create recommender services able 
to link one Referent to the other, a network of item rela- 
tionships needed to be extracted from the CalState usage 
data. It was therefore subjected to a technique similar to 
that employed by amazon.com which relates products if they 
have been purchased by the same users. As a corrolary we 
related pairs of Referents in the CalState usage data accord- 

^^Later modifications were made to replace the IP address 
with an anonymous session ID to avoid this issues 



<oai : record> 
<oai : header> 

<oai : identifier> 
urn :UUID:58f202ac -22cf -lldl-bl2d -002035 b29062 

</identifier > 

<oai : datestamp> 

2005-11- 12T21 : 2 1 : 5 IZ 

</ oai : datestamp> 
</ oai : header> 
<oai : metadata> 

<ctx : context— object 

ident ifier=" urn : UUID : 5 8 f 2 2 ac -22cf -lldl-bl2d . . . 
timestamp=" 2005-ll-llT17:45:08Z"> 
<ctx : referent > 

<ctx: identifier > 

info:doi/10.1016/j. ipm .2005.03.024 
</ctx : identifier > 

</ctx : referent > 

</ctx : context— object> 
</ o a i : metadata> 
</oai : record> 
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Table 4: Sample of journal usage matrix. 



ing to the degree to which they were consistently accessed 
by the same users. More details on this technique, which is 
highly related to collaborative filtering [17] and association 
rule learning [16], can be found in [3]. 

This procedure resulted in a network of Referent relation- 
ships represented by an Referent relationship matrix. Since 
usage is not bound by Referent type (journals, articles, etc) 
Referents were not differentiated in the network and there- 
fore journals could e.g. be related to articles and vice versa. 
As a starting point for our preliminary data analysis we 
needed a more focused network. Therefore, a network of re- 
lationships between journals was generated by aggregating 
all journal article relationships between articles published 
in the same journal. The resulting journal relationship net- 
work pertained to a set of 45,554 journals for which 1,927,506 
journal relationships were derived. Table 4 shows a sample 
of the matrix representing this journal level network; each 
entry in the matrix reflects the strength of the relationship 
as inferred from the usage data collection. 

3.3 Usage impact ranking 

The Google search engine uses the PageRank algorithm 
to determine the impact of web pages on the basis of how 
often they are linked to by high impact web pages [6, 2]. 
A page receiving many in-links from high-impact pages is 
assumed to be of high impact itself. Since a network of 
journal relationships has been established from the men- 
tioned CalState usage data, its journals can now be ranked 
according to the same algorithm which has proven effective 
in web searches. The PageRank values calculated on the ba- 
sis of usage-defined journal networks is referred to as Usage 
PageRank. 

Table 5 lists the 10 highest scoring journals according to 
their Usage PageRank and their corresponding 2003 citation 
Impact Factors. The latter reflects the impact of a journal 
according to the frequency by which its articles are cited 
over a 2 year period, i.e. 2001 and 2002. A juxtaposition 
between the Usage PageRank and citation Impact Factors 
can therefore reveal how the impact of a journal in the Cal- 
State system deviates from the general scholarly community. 
Journals whose usage PageRank deviates strongly from the 
citation impact factor are therefore marked with a 

These results indicate that usage and citation indicators 
of journal impact agree only for a number of top journals 
such as Nature and Science. However, for a large group of 
journals there exist significant deviations which correspond 
to what one could assume is the institutional research focus 
in the CalState system. In other words, the journals "Na- 
ture" and "Science" are equally important in the CalState 
community as they are in the general scholarly community. 



but the Journal of Advances in Nursing (J ADV NURS) is 
much more important in the CalState community than its 
general citation rates indicate. 

The rankings on the basis of Usage PageRank offer the 
enticing possibility of more accurately pinpointing the dy- 
namic preferences and interests of a particular user com- 
munity, in this case CalState. In addition, they conflrrn 
earlier results obtained for the Los Alamos National Labo- 
ratory user community [3]. One could speculate that with 
increasing sample sizes these rankings could provide a global 
indication of the status of scholarly communication items for 
the entire scholarly community thereby augmenting existing 
citation-based methods. 
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NEW ENGL J MED 
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MED SCI SPORT EXER ★ 
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J ADV NURS ★ 


10 


39.123 


5.692 


AM J CLIN NUTR ★ 



Table 5: Journal PageRank (PRw) in CalState us- 
age data and 2003 citation Impact Factors (IF03). 



3.4 Journal level interest mapping 

The ranking of journals according to Usage PageRank of- 
fers an informative view of the CalState community's char- 
acteristics. A comparison with the 2003 citation Impact Fac- 
tors highlights journals of particular interest. To describe 
the properties of this community in finer detail, a geograph- 
ical mapping of journal usage relationships can be generated 
by a Principal Component Analysis [13]. Such a mapping 
places journals in a 2-dimensional location according to how 
similar or dissimilar their usage is. Fig. 4 shows the result- 
ing mapping of journal similarities derived from the CalState 
usage data. The graph reveals three main clusters of inter- 
ests namely a "news" (top left), "psychology" (top right) 
and "public health and policy" (mid bottom) cluster. This 
mapping forms a model of how CalState users interact with 
their information services and can thus serve as the basis 
for an analysis of user habits and interests. The fact that a 
meaningful structure emerges in this mapping indicates the 
validity and quality of the aggregated linking server usage 
data. 

3.5 Recommender Services 

The generated relationship networks encode which jour- 
nals (or any type of Referents) are related in their usage to 
which others. They can therefore be used to recommend 
documents whenever a user expresses interest in a specific 

document (or set thereof). On the basis of the general Refer- 
ent relationship matrices, a prototype recommender service 
was constructed which accepts a description of a journal or 
article (identifier and/or metadata) as input and then scans 
the relationship matrices for viable suggestions. 




Figure 4: Mapping of journals accessed in CalState 
system on the basis of usage. 



Fig. 5 shows a screenshot of the implemented prototype. 
Table 6 and 7 show the results obtained for two recom- 
mendation requests. In the case of Table 6, recommenda- 
tions were requested for an article on the subject of "Circa- 
dian rhythms". Among the top ten results, we indeed find 
mostly articles strongly related to varying aspects of cir- 
cadian rhythmes and the physiological aspects of biological 
clocks. Table 7 lists the results generated for a query re- 
lating to the issue of learning reading skills at an early age. 
Indeed, all top 10 ranked recommendations relate to educa- 
tion and schooling issues. Note that results are obtained on 
the basis of usage data, not on the basis of term extraction. 
For example, in Table 6 an article entitled "You talking to 
me?" is issued as a valid recommendation, even though none 
of its metadata items matches those of the query document. 

Although these results do not represent a valid, quantita- 
tive analysis of the effectiveness of usage-based recommen- 
dations, they do serve as a promising pointer to the potential 
value of scholarly usage data for advanced end-user services. 
In fact, the principle of deriving recommender systems from 
usage data has already been widely validated in the litera- 
ture [17, 12] and we expect scholarly usage data to be no 
exception. 

4. ONGOING ISSUES 

A number of noteworthy issues related to the large-scale 
aggregation and exploitation of usage data were encountered 
in the course of the reported work. They are described in 
the remainder of this section. 

4.1 Linking server: representativeness 

There are drawbacks associated with the use of linking 
servers in a usage log aggregation framework relating to 
scope, scale and representativeness. Indeed, although OpenURL 
is widely supported by scholarly information services, sup- 
port is not universal, and especially new types of nodes in the 
scholarly communication environment such as Institutional 
Repositories and Dataset Repositories lag behind. Also, in- 
formation services present value-added services to users, the 
use of which is only recorded at the level of the information 



Figure 5: Generating usage-based recommenda- 
tions 

service itself, not at the level of the linking server. As a re- 
sult, the linking server logs do not capture all events related 
to documents referenced in information services. Indeed, 
linking server logs may validly represent the actions of its 
user base, but they will inevitably miss certain aspects of 
usage. Future investigations need to focus on the definition 
of sampling statistics to determine the representativeness of 
linking server logs. 

4.2 Referent deduplication 

When linking server logs are recorded and aggregated, it 
is of vital importance that usage events pertaining to the 
same or a different Referent are recognized as such, i.e. the 
aggregated usage data must be de-duplicated at the level of 
the Referents to avoid over- and undercounting which occurs 
when two Referents are falsely confounded or distinguished 
respectively. 

The issue of Referent de-duplication was approached by 
the introduction of a metadata-based de-duplication key that 
met the following criteria: 

1. The metadata components used in the construction of 
the de-duplication key must be available for a large ma- 
jority of the processed Referents. If not, many events 
would end up with empty fields in their keys and hence 
would lead to problematic de-duplication results. 

2. A maximum number of identical Referents and a min- 
imum number of dissimilar Referent should be joined. 

To identify de-duplication key candidates, we adopted an 
iterative procedure which selected those n-tuplets of meta- 
data items which occurred in the highest number of Refer- 
ents. From these candidates a final key was selected which 
offered the best pragmatic compromise of the availability of 
metadata components and de-duplication results. The key 
consisted of: 

{issn, start.page, publication.year, M (article.title, 25)} 
where m (articie.titie, 25) represents a fuzzy match (Lev- 
enshtein distance) on the first 25 characters of the article 
title. 

It should be noted that the proposed architecture does 
not depend on the simple de-duplication approach described 



"R. Jones (2004) Circadian rhythms: How time flies. NAT REV NEUROSCI. 5(11), 826-827" 

rank recommendation 

~ DA Golombek (2004). Signaling in the mammalia. NEUROCHEM INT 45(6), 929-936 

2 H Okamura (2004). Clock genes in cell clock: Roles, Actions, and Mysteries. J BIOL RHYTHM 19 (5), 388-399 

3 JC Leloup (2004). Modeling the mammalian circadian clock: Sensitivity analysis... J THEOR BIOL 230(4), 541-562 

4 N Allaman-Pillet. (2004) Circadian regulation of islet genes involved in insulin... MOL CELL ENDOC 226(1-2), 59-66 

5 S Panda (2004). It'as all in the timing: Many Clocks, Many Outputs. J BIOL RHYTHM 19(5), 374-387 

6 M Zatz (2004). You talking to me? J BIOL RHYTHM 19(4). 263-263 

7 Jadwiga Giebultowicz (2004). Chronobiology: Biological Timekeeping. INT COM? BIOL 44(3), 266 

8 M. Shermer (2004). None so blind. SCI AM 290(3), 42-42 

9 H Kobayashi (2004). Effect of feeding on peripheral circadian rhythms... GENES CELLS 9(9), 857-864 

10 R Sitruk-Ware (2004). New progestogens - A review of their effects. DRUG AGING 21(13). 865-883 



Table 6: Usage-based recommendations for "Circadian rhythms" query. 

"R. Gersten (2003) Teaching reading to early language learners. EDUC LEADERSHIP 60 (7), 44-49" 

rank recommendation 

A Thompson (2002). May be we can just be friends. EDUCATIONAL THEORY 52 (3), 327-38 

2 T Quiroga(2002). Phonological awareness in Spanish. J SCHOOL PSYCHOL 40(1), 85-111 

3 S Linan- Thompson (2003). Effectiveness of supplemental reading instruction... ELEM SCHOOL J 103(3), 221-238 

4 L Araujo (2002) The literacy development of kindergarten English-language... J RES CHILD EDUC 16(2), 232 

5 L. Morris (2001). Going through a bad spell: what young ESL learners... CAN MOD LANG REV 58(2), 273-286 

6 SO Dahlgren (1996). Theory of mind in non-retarded children... J CHILD PSYCHOL PSYCH 37(6), 759-763 

7 EG Cohen (2002). Can groups learn? TEACH COLL REG 104(6), 1045-1068 

8 D Freeman (2000). Meeting the needs of English language learners. TALKING POINTS 12 (1), 2-7 

9 Z Lin (2002). Discovering EFL learner's perception of prior knowledge and its roles... J RES READ 25(2), 172-90 

10 K HUIE (2003). Learning to write in the Primary Grades: Experiences of... TESOL JOURNAL 12(1), 25-31 



Table 7: Usage-beised recommendations for "Teaching reading to early language learners" query. 



here, and that alternative, and superior, schemes for the de- 
duplication of Referents can easily be integrated. Given the 
impact that the de-duplication process has on the aforemen- 
tioned services, it is imperative that this remains an impor- 
tant topic for future research. 

4.3 Agent deduplication 

The use of IP addresses to identify Requesters is prevalent, 
but leads to noisy usage data due to the use of proxies, 
localhost request, and robots/crawlers^*. Fig. 6 shows how 
the distribution of the frequency of the request issued by 
particular IP addresses is distorted by the use of proxies and 
web robots and crawlers. The distribution follows a power- 
law only when the first 25 IP addresses are discarded. These 
IP addresses were indeed shown to correspond to localhost 
requests, proxies and robots/crawlers. 

Three options were identified to mitigate this problem. 
First, the contributions of usage events originating from par- 
ticular Requesters could be weighted inversely by their fre- 
quency of occurrence, i.e. the more frequent an IP address in 
the usage data, the lesser its contributions to the final usage 
statistics. This solution has the advantage that no prede- 
fined, manual filtering of usage data is required. Second, 
a manual filtering based on knowledge of the local linking 
server setup could be conducted. This is a highly effective 
approach but cumbersome, and not scalable because of the 
manual intervention. Third, a retooling of the linking server 
to use random, unique and anonymous session ID cookies 
rather than IP addresses could be adopted. Although this 

^^Privacy concerns are discussed in the following section. 
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Figure 6: Distribution of IP addresses frequency in 
CalState logs. 

solution is prevalent in the Web-based retail environment, 
it still raises some controversy in the realm of scholarly in- 
formation services because it relics on the use of client-side 
information which raises privacy concerns. 



4.4 User privacy 

The collection and aggregation of usage data raises a mul- 



titudc of legal and policy issues which became apparent in 
the reported development and evaluation. The foremost is- 
sue is that of user identification which has acquired a defi- 
nite relevance in light of the recent demands placed on major 
search engines for reporting user actions. Although the pro- 
posed architecture allows extensive user and institutional 
data to be represented, this is not a requirement. Mea- 
sures of user identity protection can be adopted both on the 
intra- and inter-institutional level, and accommodated by 
the proposed architecture. In particulax, the use of anony- 
mous, random user (or session) IDs to replace IP addresses 
has been explored. In addition, modifications to the SFX 
linking server have been proposed to allow the use of such 
anonymous, session IDs. Future research will need to focus 
on the definition of approaches to further reduce the expo- 
sure of sensitive user- and institution related information. 



5. CONCLUSION 

We outlined an architecture for the large-scale recording, 
representation and aggregation of DL usage data. This ar- 
chitecture is standards-based and relies on the already large 
base of installed linking servers and the wide-spread adop- 
tion of OpenURL and OAI-PMH based services. We dis- 
cussed a recent evaluation of the architecture involving us- 
age data recorded for the entire CalStatc system during the 
2004 and 2005 period. It has been demonstrated that such 
logs provide an attractive starting point for services which 
support the end-user's scholarly activities, such as recom- 
mender systems, but likewise allow the scholarly community 
to monitor usage at a high level of detail. 

The discussed evaluation of the proposed architecture is 
based on a particular configuration of usage data sources, 
a particular aggregation model and a particular set of ser- 
vice provisions. It is therefore limited in its generalizability. 
Future research needs to be directed towards an investiga- 
tion of the scalability of different aggregation architectures, 
models to protect user privacy, mechanisms to ensure data 
validity and detect fraud, metrics to determine data repre- 
sentativeness and a range of technical issues associated with 
referent identification. In addition, a range of potential ap- 
plications of usage data can be further explored, i.e. metrics 
of item impact, indicators of scholarly trends [7] and end- 
user services beyond the recommender service discussed in 
this paper. 

Such a wide range of technical, scientific and policy issues 
is associated with applications of usage data that they can 
not be addressed within the framework of a single research 
project. A community effort is required to fully explore this 
emerging domain. A community of scholars is slowly emerg- 
ing but needs to be further consolidated. For that reason, 
the authors seek to organize a meeting of the different stake- 
holders and scholars in this area with the support of a promi- 
nent funding agency. The objective of such a meeting will 
be the establishment of a common research agenda around 
which a science of usage data can coalesce and develop. 
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